SWE-bench: add Slurm tests & other improvements#920
Conversation
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
WalkthroughAdds configurable retry logic and harness repo/commit options to SWE-bench evaluation code, updates documentation accordingly, and introduces Slurm test orchestration for Qwen3-Coder 30B with metric checks. Also updates Slurm README and run_all.sh to include the new test. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Task as SweBenchGenerationTask
participant C as Container
participant Git as Git Repos
rect rgba(230,245,255,0.5)
note over Task: Setup repos per cfg
Task->>Git: clone SWE-bench / agent / eval harness (cfg repos)
Task->>Git: checkout cfg commits/tags/branches
end
loop For each command
rect rgba(240,255,240,0.6)
note over Task: Retry with randomized backoff
Task->>C: Execute command (timeout=T)
alt Success
C-->>Task: Expected file(s) match pattern
else Failure or Missing output
Task->>Task: Sleep U[min_retry_interval, max_retry_interval]
note over Task: Retry until cfg.max_retries exhausted
end
end
end
note over Task: Proceed to evaluation using cfg harness and paths
sequenceDiagram
autonumber
participant CLI as run_test.py
participant Prep as prepare_data
participant Slurm as Slurm Scheduler
participant Eval as eval (per framework)
participant Check as check_results.py
CLI->>Prep: prepare_data (container_formatter)
loop For agent_framework in [openhands, swe_agent]
CLI->>Slurm: submit Eval job (expname_prefix, model/server args)
Slurm-->>Eval: run
CLI->>Slurm: submit Check job (after: Eval)
Slurm-->>Check: run (loads metrics.json, assert ranges)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tests/slurm-tests/qwen3coder_30b_swebench/run_test.py (2)
36-36: Clarify the comment about dependent jobs.The comment states "automatically rerun 2 times," but
dependent_jobs=2schedules 2 dependent jobs, resulting in 3 total runs (1 original + 2 dependents). Consider revising the comment to: "automatically schedule 2 dependent reruns (3 total runs)."
37-37: Update comment for clarity.The comment mentions "second run (swe-agent)" but this function is invoked for both
openhandsandswe_agentframeworks. The described issue with reading config files from absolute cluster paths applies to both frameworks, not specifically the second framework.Apply this diff:
- reuse_code=False, # otherwise the second run (swe-agent) tries to read the config file from the absolute cluster path and fails + reuse_code=False, # otherwise subsequent runs try to read the config file from the absolute cluster path and fail
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
docs/evaluation/code.md(3 hunks)nemo_skills/inference/eval/swebench.py(10 hunks)tests/slurm-tests/README.md(1 hunks)tests/slurm-tests/qwen3coder_30b_swebench/check_results.py(1 hunks)tests/slurm-tests/qwen3coder_30b_swebench/run_test.py(1 hunks)tests/slurm-tests/run_all.sh(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/slurm-tests/qwen3coder_30b_swebench/check_results.py (2)
tests/slurm-tests/utils.py (4)
assert_all(46-58)get_nested_value(24-30)load_json(18-21)soft_assert(36-43)tests/slurm-tests/qwen3coder_30b_swebench/run_test.py (1)
main(45-86)
nemo_skills/inference/eval/swebench.py (1)
nemo_skills/inference/chat_interface/core.py (1)
cfg(181-182)
tests/slurm-tests/qwen3coder_30b_swebench/run_test.py (3)
nemo_skills/pipeline/run_cmd.py (1)
run_cmd(40-211)nemo_skills/pipeline/cli.py (1)
wrap_arguments(45-54)tests/slurm-tests/qwen3coder_30b_swebench/check_results.py (1)
main(44-52)
🪛 Ruff (0.13.3)
tests/slurm-tests/qwen3coder_30b_swebench/check_results.py
22-22: Unused noqa directive (non-enabled: E402)
Remove unused noqa directive
(RUF100)
nemo_skills/inference/eval/swebench.py
253-253: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
264-266: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🪛 Shellcheck (0.11.0)
tests/slurm-tests/run_all.sh
[warning] 15-15: This { is literal. Check expression (missing ;/\n?) or quote it.
(SC1083)
[warning] 15-15: This } is literal. Check expression (missing ;/\n?) or quote it.
(SC1083)
🔇 Additional comments (1)
tests/slurm-tests/qwen3coder_30b_swebench/run_test.py (1)
85-85: To inspect howadd_taskhandlesrun_after, run:#!/bin/bash rg -nP "def add_task" -C3 nemo_skills/pipeline
| for category_tuple, expected_range in METRIC_RANGES[agent_framework].items(): | ||
| val = float(get_nested_value(data, category_tuple)) | ||
| lo, hi = expected_range | ||
| soft_assert(lo <= val <= hi, f"swe-bench ({agent_framework}) {category_tuple}={val} out of range [{lo},{hi}]") |
There was a problem hiding this comment.
Handle missing or non-numeric metrics without crashing
get_nested_value returns None when the metric is absent (e.g., evaluation failed), so float(None) raises and the script aborts before calling assert_all(). We should soft-assert in that case to report a clear failure instead of throwing a traceback.
- val = float(get_nested_value(data, category_tuple))
- lo, hi = expected_range
- soft_assert(lo <= val <= hi, f"swe-bench ({agent_framework}) {category_tuple}={val} out of range [{lo},{hi}]")
+ raw_val = get_nested_value(data, category_tuple)
+ if raw_val is None:
+ soft_assert(False, f"swe-bench ({agent_framework}) {category_tuple} missing")
+ continue
+ try:
+ val = float(raw_val)
+ except (TypeError, ValueError):
+ soft_assert(False, f"swe-bench ({agent_framework}) {category_tuple}={raw_val} not numeric")
+ continue
+ lo, hi = expected_range
+ soft_assert(lo <= val <= hi, f"swe-bench ({agent_framework}) {category_tuple}={val} out of range [{lo},{hi}]")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for category_tuple, expected_range in METRIC_RANGES[agent_framework].items(): | |
| val = float(get_nested_value(data, category_tuple)) | |
| lo, hi = expected_range | |
| soft_assert(lo <= val <= hi, f"swe-bench ({agent_framework}) {category_tuple}={val} out of range [{lo},{hi}]") | |
| for category_tuple, expected_range in METRIC_RANGES[agent_framework].items(): | |
| raw_val = get_nested_value(data, category_tuple) | |
| if raw_val is None: | |
| soft_assert(False, f"swe-bench ({agent_framework}) {category_tuple} missing") | |
| continue | |
| try: | |
| val = float(raw_val) | |
| except (TypeError, ValueError): | |
| soft_assert(False, f"swe-bench ({agent_framework}) {category_tuple}={raw_val} not numeric") | |
| continue | |
| lo, hi = expected_range | |
| soft_assert(lo <= val <= hi, | |
| f"swe-bench ({agent_framework}) {category_tuple}={val} out of range [{lo},{hi}]") |
There was a problem hiding this comment.
doesn't really matter here, we're only checking one metric value, so it can only fail once regardless.
also we don't bother with this in the other check_results.py scripts...
There was a problem hiding this comment.
@ludwig-n, that's fair. Given there's only one metric being checked and the existing pattern in other check_results.py scripts, the additional error handling isn't necessary. I appreciate the clarification!
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
| benchmarks="swe-bench", | ||
| num_chunks=8, | ||
| dependent_jobs=2, # automatically rerun 2 times because it's common for some instances to fail | ||
| reuse_code=False, # otherwise the second run (swe-agent) tries to read the config file from the absolute cluster path and fails |
There was a problem hiding this comment.
could you share an error please? This shouldn't be happening
There was a problem hiding this comment.
Merging for now, but we should address this in another pr
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
This PR adds Slurm tests for SWE-bench. Specifically, the Qwen3-Coder 30B model is evaluated with OpenHands and SWE-agent.
It also adds the following features & fixes:
Summary by CodeRabbit
New Features
Documentation
Tests